LLM 进阶——微调与部署

文章摘要 FakeGPT

加载中...|

此内容根据文章生成，并经过人工审核，仅用于文章内容的解释与总结投诉

概述

在第一篇文章中，我们了解了 LLM 的基本概念和使用方法。虽然预训练模型已经具备强大的能力，但在特定场景下，我们可能需要让模型适应特定的任务、风格或领域知识。本文将介绍 LLM 微调（Fine-tuning）的各种方法和本地部署方案，帮助你定制化属于自己的模型。

什么是模型微调

微调的概念

微调是在预训练模型的基础上，使用特定领域的数据继续训练模型，使其更好地适应特定任务。

text

预训练模型 (通用能力)
        │
        │ 微调
        ▼
┌─────────────────────────────────────────────────────────┐
│               微调后的模型 (领域专精)                     │
│                                                         │
│  通用对话模型    →    客服对话模型                        │
│  通用代码模型    →    企业内部代码模型                    │
│  通用翻译模型    →    特定行业翻译模型                    │
│  通用写作模型    →    特定风格写作模型                    │
└─────────────────────────────────────────────────────────┘

何时需要微调

场景	是否需要微调	替代方案
使用通用知识	❌	直接使用基础模型
特定领域术语	✅	添加领域词典/RAG
特定输出格式	✅	Prompt Engineering
特定风格/语调	✅	Few-shot 示例
企业私有数据	✅	RAG
代码生成（私有框架）	✅	提供文档示例
成本敏感	⚠️	使用小模型+RAG

建议：优先考虑 Prompt Engineering 和 RAG，确实无法满足时再选择微调。

微调方法详解

方法对比总览

text

┌─────────────────────────────────────────────────────────┐
│                    微调方法对比                           │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Full Fine-tuning                                       │
│  ┌─────────────────────────────────────────────────┐    │
│  │ 更新所有参数 │ 显存需求高 │ 训练时间长 │ 效果最好  │    │
│  └─────────────────────────────────────────────────┘    │
│                                                         │
│  LoRA (Low-Rank Adaptation)                             │
│  ┌─────────────────────────────────────────────────┐    │
│  │ 冻结主参数 │ 训练少量适配器 │ 显存需求低 │ 效果好   │    │
│  └─────────────────────────────────────────────────┘    │
│                                                         │
│  QLoRA (Quantized LoRA)                                 │
│  ┌─────────────────────────────────────────────────┐    │
│  │ 量化主模型 │ LoRA 微调 │ 显存需求最低 │ 推荐使用   │    │
│  └─────────────────────────────────────────────────┘    │
│                                                         │
│  Prompt Tuning                                          │
│  ┌─────────────────────────────────────────────────┐    │
│  │ 只训练 prompt │ 参数极少 │ 效果有限 │ 快速尝试   │    │
│  └─────────────────────────────────────────────────┘    │
│                                                         │
└─────────────────────────────────────────────────────────┘

Full Fine-tuning

更新模型的所有参数，效果最好但资源需求最高。

text

Full Fine-tuning 流程：

预训练模型 (70B 参数)
        │
        ▼
┌─────────────────────────────────────────────────────────┐
│  训练过程                                                │
│                                                         │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐               │
│  │ 层 1    │   │ 层 2    │   │ 层 N    │               │
│  │ 更新全部 │   │ 更新全部 │   │ 更新全部 │               │
│  └─────────┘   └─────────┘   └─────────┘               │
│                                                         │
│  显存需求：~700GB (70B 模型)                            │
│  训练时间：数周到数月                                    │
│                                                         │
└─────────────────────────────────────────────────────────┘
        │
        ▼
   微调后模型

资源需求：

模型规模	显存需求	推荐硬件
7B	~80GB	1×A100 (80GB)
13B	~150GB	2×A100 (80GB)
70B	~700GB	8×A100 (80GB)

LoRA (Low-Rank Adaptation)

LoRA 是最流行的微调方法，通过训练低秩矩阵来适配模型。

text

LoRA 原理：

原始权重矩阵 W (d×d)
        │
        │ W + ΔW = W + BA
        │              │
        │              └─ B (d×r) × A (r×d)
        │                  r << d (rank 很小)
        ▼

┌─────────────────────────────────────────────────────────┐
│  优势：                                                 │
│  • 只训练约 1% 的参数                                   │
│  • 显存需求降低 75%                                     │
│  • 可轻松切换不同的 LoRA 适配器                         │
│  • 效果接近 Full Fine-tuning                            │
└─────────────────────────────────────────────────────────┘

LoRA 配置示例：

yaml

# LoRA 超参数
lora_r: 16           # rank，常用 8/16/32
lora_alpha: 32       # scaling factor = alpha/r
lora_dropout: 0.05   # dropout 防止过拟合
target_modules:      # 要微调的模块
  - q_proj
  - v_proj
  - k_proj
  - o_proj

QLoRA (Quantized LoRA)

QLoRA 在 LoRA 基础上对主模型进行量化，进一步降低显存需求。

text

QLoRA 流程：

预训练模型
        │
        ▼ 量化到 4-bit
┌─────────────────────────────────────────────────────────┐
│  量化后模型                                               │
│  • 70B 模型约需 40GB 显存（FP16 需要 140GB）            │
│  • 使用 NF4 量化保持精度                                │
│  • 双重量化缓解精度损失                                  │
└─────────────────────────────────────────────────────────┘
        │
        ▼ LoRA 微调
   微调模型 (LoRA 适配器)

资源需求对比：

方法	7B 模型	13B 模型	70B 模型
Full Fine-tuning	~80GB	~150GB	~700GB
LoRA	~20GB	~40GB	~200GB
QLoRA	~10GB	~20GB	~48GB

微调方法选择建议

text

┌─────────────────────────────────────────────────────────┐
│                    如何选择微调方法                       │
└─────────────────────────────────────────────────────────┘

显存 >= 48GB？
├─ 是 → 考虑 QLoRA（推荐）或 LoRA
└─ 否 → 考虑使用云服务或更小的模型

是否需要最佳效果？
├─ 是 → Full Fine-tuning（资源充足）
└─ 否 → LoRA/QLoRA（性价比高）

是否需要多个适配器？
├─ 是 → LoRA（易于切换）
└─ 否 → QLoRA（显存友好）

训练数据准备

数据集格式

微调数据通常采用 JSONL 格式：

json

// 指令微调格式
{
  "instruction": "请解释什么是量子计算",
  "input": "",
  "output": "量子计算是一种利用量子力学原理..."
}

// 对话格式
{
  "messages": [
    { "role": "system", "content": "你是一个Python专家" },
    { "role": "user", "content": "如何使用列表推导式？" },
    { "role": "assistant", "content": "列表推导式是Python中..." },
    { "role": "user", "content": "能给个例子吗？" },
    { "role": "assistant", "content": "当然！比如..." }
  ]
}

// 完成格式
{
  "prompt": "def fibonacci(n):",
  "completion": "    if n <= 1:\n        return n\n    return fibonacci(n-1) + fibonacci(n-2)"
}

数据集规模建议

任务类型	最小样本	推荐样本	质量要求
风格迁移	100-500	1000+	高质量
格式调整	50-200	500+	高质量
领域适配	500-2000	5000+	中等质量
代码生成	1000+	10000+	高质量

注意：数据质量比数量更重要！高质量 1000 条 > 低质量 10000 条

数据收集策略

python

# 1. 现有数据整理
# - 企业文档
# - 客服对话记录
# - 代码仓库
# - 产品手册

# 2. 人工生成
# - 专家编写问答对
# - 标注数据集
# - 修正模型输出

# 3. 数据增强
# - 同义改写
# - 举例扩展
# - 反例生成

# 4. 数据清洗
# 去重
def deduplicate(data):
    seen = set()
    unique = []
    for item in data:
        hash = hash(item['text'])
        if hash not in seen:
            seen.add(hash)
            unique.append(item)
    return unique

# 去除隐私信息
def remove_pii(text):
    import re
    # 电话
    text = re.sub(r'\d{3}-\d{4}-\d{4}', '[PHONE]', text)
    # 邮箱
    text = re.sub(r'\S+@\S+', '[EMAIL]', text)
    return text

# 质量过滤
def filter_quality(data):
    # 过滤太短的样本
    # 过滤重复信息
    # 过滤无意义内容
    return [item for item in data if len(item['text']) > 50]

数据集划分

python

from sklearn.model_selection import train_test_split

# 划分训练集和验证集
train_data, val_data = train_test_split(
    all_data,
    test_size=0.1,      # 10% 作为验证集
    random_state=42
)

# 保存数据集
import json

def save_jsonl(data, path):
    with open(path, 'w', encoding='utf-8') as f:
        for item in data:
            f.write(json.dumps(item, ensure_ascii=False) + '\n')

save_jsonl(train_data, 'train.jsonl')
save_jsonl(val_data, 'validation.jsonl')

微调实战

使用 Hugging Face Transformers

bash

# 安装依赖
pip install transformers datasets peft accelerate bitsandbytes

python

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset

# 1. 加载模型和分词器
model_name = "Qwen/Qwen2.5-7B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 2. 使用 QLoRA 加载量化模型
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

# 3. 准备 LoRA 配置
lora_config = LoraConfig(
    r=16,                     # rank
    lora_alpha=32,            # alpha
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# 4. 加载数据集
dataset = load_dataset('json', data_files={
    'train': 'train.jsonl',
    'validation': 'validation.jsonl'
})

# 5. 数据预处理
def preprocess_function(examples):
    inputs = [f"Instruction: {i}\nResponse: {o}"
              for i, o in zip(examples['instruction'], examples['output'])]

    model_inputs = tokenizer(
        inputs,
        max_length=512,
        truncation=True,
        padding="max_length"
    )

    labels = tokenizer(
        examples['output'],
        max_length=512,
        truncation=True,
        padding="max_length"
    )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True)

# 6. 训练配置
training_args = TrainingArguments(
    output_dir="./qwen-lora",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    warmup_steps=100,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_steps=100,
    eval_steps=100,
    save_total_limit=2,
)

# 7. 创建 Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
)

# 8. 开始训练
trainer.train()

# 9. 保存模型
model.save_pretrained("./my-finetuned-model")
tokenizer.save_pretrained("./my-finetuned-model")

使用 LLamaFactory 微调

bash

# 安装 LLamaFactory
pip install llama-factory

# 启动 Web UI
llamafactory-cli webui

yaml

# config.yaml
### 模型
model_name_or_path: Qwen/Qwen2.5-7B-Instruct

### 方法
finetuning_type: lora
lora_target: q_proj,v_proj
lora_rank: 16
lora_alpha: 32

### 数据集
dataset: my_dataset
dataset_dir: data
template: qwen
cutoff_len: 1024

### 输出
output_dir: saves/qwen-7b-lora
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### 训练
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
learning_rate: 5.0e-05
num_train_epochs: 3.0
lr_scheduler_type: cosine
fp16: true

bash

# 开始训练
llamafactory-cli train config.yaml

使用 Axolotl 微调

bash

# 安装 Axolotl
git clone https://github.com/OpenAccess-AI-Collective/axolotl
cd axolotl
pip install -e .

# 配置文件
cat <<EOF > my_config.yml
base_model: Qwen/Qwen2.5-7B-Instruct
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: true
lora_r: 16
lora_alpha: 32
lora_finetuning_target: ["q_proj", "v_proj"]

dataset: my_dataset
data_config: data/config.yaml

trainer:
  gradient_accumulation_steps: 4
  batch_size: 4
  num_epochs: 3
  learning_rate: 2e-4
  fp16: true

output_dir: ./output
EOF

# 开始训练
accelerate launch -m axolotl.cli.train my_config.yml

模型量化与部署

什么是模型量化

量化是指将模型参数从高精度（FP32/FP16）转换为低精度（INT8/INT4），从而减少显存占用和提高推理速度。

text

精度对比：

FP32 (32位浮点)  →  每参数 4 字节  →  70B 模型约 280GB
FP16 (16位浮点)  →  每参数 2 字节  →  70B 模型约 140GB
INT8 (8位整数)   →  每参数 1 字节  →  70B 模型约 70GB
INT4 (4位整数)   →  每参数 0.5 字节 →  70B 模型约 35GB

量化方法

方法	精度损失	显存节省	推荐场景
FP16	无	50%	服务器部署
INT8	极小	75%	边缘设备
INT4 (GPTQ)	小	87.5%	本地部署
INT4 (AWQ)	小	87.5%	本地部署
INT4 (NF4)	极小	87.5%	QLoRA 训练

使用 llama.cpp 量化

bash

# 克隆仓库
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# 下载模型
# 从 Hugging Face 下载原始模型

# 转换为 GGUF 格式
python convert.py /path/to/model --outfile qwen7b-v1.5-f16.gguf --outtype f16

# 量化为 Q4_K_M
./quantize qwen7b-v1.5-f16.gguf qwen7b-v1.5-q4_k_m.gguf Q4_K_M

# 运行
./main -m qwen7b-v1.5-q4_k_m.gguf -p "你好"

使用 AWQ 量化

bash

# 安装 AutoAWQ
pip install autoawq

# 量化模型
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "Qwen/Qwen2.5-7B-Instruct"
quant_path = "qwen-7b-awq"

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

model = AutoAWQForCausalLM.from_pretrained(
    model_path,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

model.quantize(tokenizer, quant_config=quant_config)

model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

本地部署实战

Ollama - 最简单的部署方式

bash

# 安装 Ollama
# macOS: 下载 .dmg 文件
# Linux: curl -fsSL https://ollama.ai/install.sh | sh

# 拉取模型
ollama pull qwen2.5:7b
ollama pull llama3.1:70b

# 运行聊天
ollama run qwen2.5:7b

# API 服务（默认端口 11434）
ollama serve

# API 调用
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:7b",
  "prompt": "你好"
}'

# Python 示例
import requests

response = requests.post('http://localhost:11434/api/generate', json={
    "model": "qwen2.5:7b",
    "prompt": "你好",
    "stream": False
})

print(response.json()['response'])

vLLM - 高性能推理引擎

bash

# 安装 vLLM
pip install vllm

# 启动服务
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000

# OpenAI 兼容 API
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen2.5-7B-Instruct",
  "messages": [{"role": "user", "content": "你好"}]
}'

python

# Python 客户端
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="empty"
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "你好"}]
)

print(response.choices[0].message.content)

LocalAI - 多模型支持

bash

# 使用 Docker
docker run -p 8080:8080 \
    -v /models:/models \
    localai/localai:latest \
    --models-path /models

# 配置模型
cat <<EOF > /models/qwen.yaml
name: qwen
parameters:
  model: qwen2.5-7b.gguf
context_size: 4096
f16: true
EOF

# 启动服务
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "qwen",
  "messages": [{"role": "user", "content": "你好"}]
}'

Text Generation Inference (TGI)

bash

# 使用 Docker
docker run --gpus all --shm-size 1g -p 8080:80 \
    -v $PWD/data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id Qwen/Qwen2.5-7B-Instruct \
    --quantize awq

# API 调用
curl http://localhost:8080/generate -X POST -H "Content-Type: application/json" \
    -d '{"inputs": "你好", "parameters": {"max_new_tokens": 128}}'

硬件需求参考

推理硬件需求

模型大小	FP16	INT8	INT4	推荐配置
7B	14GB	7GB	4GB	RTX 3060 (12GB)
13B	26GB	13GB	8GB	RTX 4070 Ti (16GB)
30B	60GB	30GB	16GB	RTX 3090 (24GB) + CPU offload
70B	140GB	70GB	40GB	2×RTX 3090 或 A100 (40GB)

训练硬件需求

任务	模型	方法	显存需求	推荐硬件
微调	7B	QLoRA	12GB	RTX 3060
微调	13B	QLoRA	20GB	RTX 3090
微调	70B	QLoRA	48GB	A100 40GB
微调	7B	Full	80GB	A100 80GB

成本与性能优化

推理性能优化

python

# 1. 使用 Flash Attention
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    use_flash_attention_2=True,  # 加速 2-3x
    device_map="auto"
)

# 2. 使用 KV Cache
outputs = model.generate(
    inputs,
    use_cache=True,  # 启用 KV cache
    past_key_values=past_kv  # 复用缓存
)

# 3. 批处理
# 一次处理多个请求，提高吞吐量

# 4. 量化
# 使用 INT4/INT8 量化模型

# 5. Continuous Batching
# 使用 vLLM 或 TGI 的连续批处理功能

成本优化建议

text

┌─────────────────────────────────────────────────────────┐
│                    成本优化策略                           │
└─────────────────────────────────────────────────────────┘

1. 模型选择
   • 小模型 (7B/13B) > 大模型 (70B)
   • 开源模型 > 商业 API

2. 部署方式
   • 本地部署 (高 QPS) > 云 API (低 QPS)
   • Spot 实例 > 按需实例

3. 推理优化
   • 量化 (INT4) > FP16
   • Flash Attention > 标准 Attention
   • vLLM/TGI > 原生 Transformers

4. 缓存策略
   • 语义缓存（相同问题复用答案）
   • KV Cache（对话历史复用）
   • 结果缓存（固定问答）

5. 提示优化
   • 精简 prompt
   • 减少系统 prompt
   • 使用更短的示例

监控与调试

python

# 监控显存使用
import torch

print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"GPU Memory Reserved: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")

# 监控推理速度
import time

start = time.time()
outputs = model.generate(inputs, max_new_tokens=100)
elapsed = time.time() - start
tokens_per_second = 100 / elapsed
print(f"Speed: {tokens_per_second:.2f} tokens/sec")

# 使用 TensorBoard 监控训练
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter()
writer.add_scalar('Loss/train', loss, step)
writer.add_scalar('LR', learning_rate, step)

小结

本文介绍了 LLM 微调的完整流程和本地部署方案：

核心要点

微调方法选择
- QLoRA 是性价比最高的选择
- LoRA 适合需要多适配器的场景
- Full Fine-tuning 效果最好但资源需求高
数据准备
- 数据质量比数量重要
- 至少准备几百到几千条高质量数据
- 记得划分训练集和验证集
部署选择
- Ollama：最简单，适合个人使用
- vLLM：高性能，适合生产环境
- TGI：Hugging Face 官方方案
优化建议
- 使用量化降低显存需求
- 使用 Flash Attention 加速推理
- 合理配置批处理和缓存

下一篇文章将介绍 AI Agent 的原理与实践，教你如何构建能自主思考和行动的 AI 系统。

LLM 进阶——微调与部署https://indulgeback.github.io/posts/AI%E4%B8%8ELLM/2%E3%80%81LLM%20%E8%BF%9B%E9%98%B6%E2%80%94%E2%80%94%E5%BE%AE%E8%B0%83%E4%B8%8E%E9%83%A8%E7%BD%B2

作者LeviLiu

发布于1/5

更新于3天前

许可协议 CC BY-NC-SA 4.0

署名-非商业性使用-相同方式共享 4.0 国际

LLM Fine-tuning LoRA 部署

反馈与投诉

赞赏博主

评论隐私政策

LLM 进阶——微调与部署

概述 ​

什么是模型微调 ​

微调的概念 ​

何时需要微调 ​

微调方法详解 ​

方法对比总览 ​

Full Fine-tuning ​

LoRA (Low-Rank Adaptation) ​

QLoRA (Quantized LoRA) ​

微调方法选择建议 ​

训练数据准备 ​

数据集格式 ​

数据集规模建议 ​

数据收集策略 ​

数据集划分 ​

微调实战 ​

使用 Hugging Face Transformers ​

使用 LLamaFactory 微调 ​

使用 Axolotl 微调 ​

模型量化与部署 ​

什么是模型量化 ​

量化方法 ​

使用 llama.cpp 量化 ​

使用 AWQ 量化 ​

本地部署实战 ​

Ollama - 最简单的部署方式 ​

vLLM - 高性能推理引擎 ​

LocalAI - 多模型支持 ​

Text Generation Inference (TGI) ​

硬件需求参考 ​

推理硬件需求 ​

训练硬件需求 ​

成本与性能优化 ​

推理性能优化 ​

成本优化建议 ​

监控与调试 ​

小结 ​

核心要点 ​

概述

什么是模型微调

微调的概念

何时需要微调

微调方法详解

方法对比总览

Full Fine-tuning

LoRA (Low-Rank Adaptation)

QLoRA (Quantized LoRA)

微调方法选择建议

训练数据准备

数据集格式

数据集规模建议

数据收集策略

数据集划分

微调实战

使用 Hugging Face Transformers

使用 LLamaFactory 微调

使用 Axolotl 微调

模型量化与部署

什么是模型量化

量化方法

使用 llama.cpp 量化

使用 AWQ 量化

本地部署实战

Ollama - 最简单的部署方式

vLLM - 高性能推理引擎

LocalAI - 多模型支持

Text Generation Inference (TGI)

硬件需求参考

推理硬件需求

训练硬件需求

成本与性能优化

推理性能优化

成本优化建议

监控与调试

小结

核心要点